Skip to content

monitoring: add consolidated workloads CPU/memory dashboard#209

Closed
ejahnGithub wants to merge 5 commits intosigstore:mainfrom
ejahnGithub:workloads-cpu-memory-dashboard
Closed

monitoring: add consolidated workloads CPU/memory dashboard#209
ejahnGithub wants to merge 5 commits intosigstore:mainfrom
ejahnGithub:workloads-cpu-memory-dashboard

Conversation

@ejahnGithub
Copy link
Copy Markdown

@ejahnGithub ejahnGithub commented May 5, 2026

Summary

Adds a single GCP Monitoring dashboard Workloads CPU & Memory that consolidates CPU and memory across all Sigstore GKE workloads (grouped by namespace / container_name), so oncallers do not have to navigate multiple metric pages while investigating resource issues.

It includes (per namespace / container):

  • CPU usage (cores) + Memory used (bytes)
  • CPU & Memory limit utilization
  • CPU & Memory request utilization
  • Container restarts (delta 5m)
  • Node CPU & Memory allocatable utilization
  • Ephemeral storage used
  • Pod network RX / TX
  • Running containers per namespace

Testing

  • JSON syntax validated
  • terraform fmt -recursive clean
  • Tested locally by importing the dashboard JSON into staging via the Cloud Console UI

Will roll out to staging first via the usual Sigstore CI flow once merged.

Issue

Resolves sigstore/public-good-instance#1122

@ejahnGithub ejahnGithub requested a review from a team as a code owner May 5, 2026 19:26
@ejahnGithub ejahnGithub marked this pull request as draft May 5, 2026 19:34
Eugene Jahn and others added 3 commits May 7, 2026 10:39
Adds a single GCP Monitoring dashboard that surfaces CPU and memory
across all Sigstore GKE workloads (grouped by namespace / container),
so oncall does not have to navigate multiple metric pages while
investigating resource issues.

The dashboard includes:
  - CPU usage in cores (rate of core_usage_time)
  - Memory used (non-evictable bytes)
  - CPU/memory limit utilization (REDUCE_MAX so a hot replica is visible)
  - CPU/memory request utilization (REDUCE_MAX)
  - Container restart deltas
  - Node CPU allocatable utilization

Resolves sigstore/public-good-instance#1122

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
The xyChart threshold schema does not accept color/direction for these
chart types; the dashboard create rejects them. Keep just the value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
Heights of 16 in a 12-column mosaic produced very tall narrow tiles.
Use h=4 (standard) for charts and keep h=4 for the overview banner.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
@ejahnGithub ejahnGithub force-pushed the workloads-cpu-memory-dashboard branch from e5d92ad to 939aeed Compare May 7, 2026 14:39
Eugene Jahn and others added 2 commits May 7, 2026 10:46
…iles to workloads dashboard

Mirrors the standard GKE Workloads dashboard so oncall does not have
to navigate to multiple pages to find resource usage charts:

  - Pod network received / sent (per namespace)
  - Ephemeral storage used (per container)
  - Node memory allocatable utilization (sibling of node CPU)
  - Running containers per namespace (uptime count)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
Previous tile used ALIGN_COUNT + REDUCE_SUM, which sums sample counts
within the alignment window and is an approximation of container
count. Switch to ALIGN_MEAN per series + REDUCE_COUNT across series
so the y-axis is the exact number of running containers per
namespace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
@ejahnGithub ejahnGithub marked this pull request as ready for review May 7, 2026 14:58
@ejahnGithub ejahnGithub closed this by deleting the head repository May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant